AITopics | opponent strategy

Partial monitoring is a general model for online learning with limited feedback: a learner chooses actions in a sequential manner while an opponent chooses outcomes. In every round, the learner suffers some loss and receives some feedback based on the action and the outcome.

algorithm, opponent, outcome distribution, (13 more...)

Neural Information Processing Systems

Country:

Europe > Switzerland (0.05)
North America > United States (0.04)
Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)

Industry:

Leisure & Entertainment > Games (0.48)
Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.94)
Information Technology > Communications > Social Media (0.68)
Information Technology > Game Theory (0.68)

Add feedback

Strategy-Augmented Planning for Large Language Models via Opponent Exploitation

Xu, Shuai, Cui, Sijia, Wang, Yanna, Xu, Bo, Wang, Qi

arXiv.org Artificial IntelligenceJun-3-2025

Efficiently modeling and exploiting opponents is a long-standing challenge in adversarial domains. Large Language Models (LLMs) trained on extensive textual data have recently demonstrated outstanding performance in general tasks, introducing new research directions for opponent modeling. Some studies primarily focus on directly using LLMs to generate decisions based on the elaborate prompt context that incorporates opponent descriptions, while these approaches are limited to scenarios where LLMs possess adequate domain expertise. To address that, we introduce a two-stage Strategy-Augmented Planning (SAP) framework that significantly enhances the opponent exploitation capabilities of LLM-based agents by utilizing a critical component, the Strategy Evaluation Network (SEN). Specifically, in the offline stage, we construct an explicit strategy space and subsequently collect strategy-outcome pair data for training the SEN network. During the online phase, SAP dynamically recognizes the opponent's strategies and greedily exploits them by searching best response strategy on the well-trained SEN, finally translating strategy to a course of actions by carefully designed prompts. Experimental results show that SAP exhibits robust generalization capabilities, allowing it to perform effectively not only against previously encountered opponent strategies but also against novel, unseen strategies. In the MicroRTS environment, SAP achieves a $85.35\%$ performance improvement over baseline methods and matches the competitiveness of reinforcement learning approaches against state-of-the-art (SOTA) rule-based AI. Our code is available at https://github.com/hsushuai/SAP.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2505.08459

Country:

Asia > China (0.29)
North America > United States (0.28)

Genre: Research Report > New Finding (0.34)

Industry: Leisure & Entertainment > Games > Computer Games (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Efficient Partial Monitoring with Prior Information

Hastagiri P. Vanchinathan, Gábor Bartók, Andreas Krause

Neural Information Processing SystemsFeb-8-2025, 16:45:55 GMT

Partial monitoring is a general model for online learning with limited feedback: a learner chooses actions in a sequential manner while an opponent chooses outcomes. In every round, the learner suffers some loss and receives some feedback based on the action and the outcome. The goal of the learner is to minimize her cumulative loss. Applications range from dynamic pricing to label-efficient prediction to dueling bandits. In this paper, we assume that we are given some prior information about the distribution based on which the opponent generates the outcomes. We propose BPM, a family of new efficient algorithms whose core is to track the outcome distribution with an ellipsoid centered around the estimated distribution. We show that our algorithm provably enjoys near-optimal regret rate for locally observable partial-monitoring problems against stochastic opponents. As demonstrated with experiments on synthetic as well as real-world data, the algorithm outperforms previous approaches, even for very uninformed priors, with an order of magnitude smaller regret and lower running time.

artificial intelligence, machine learning, outcome distribution, (15 more...)

Neural Information Processing Systems

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States (0.04)
Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)

Industry:

Leisure & Entertainment > Games (0.48)
Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.94)
Information Technology > Game Theory (0.68)
Information Technology > Communications > Social Media (0.68)

Add feedback

SMAC-Hard: Enabling Mixed Opponent Strategy Script and Self-play on SMAC

Deng, Yue, Yu, Yan, Ma, Weiyu, Wang, Zirui, Zhu, Wenhui, Zhao, Jian, Zhang, Yin

arXiv.org Artificial IntelligenceDec-24-2024

The availability of challenging simulation environments is pivotal for advancing the field of Multi-Agent Reinforcement Learning (MARL). In cooperative MARL settings, the StarCraft Multi-Agent Challenge (SMAC) has gained prominence as a benchmark for algorithms following centralized training with decentralized execution paradigm. However, with continual advancements in SMAC, many algorithms now exhibit near-optimal performance, complicating the evaluation of their true effectiveness. To alleviate this problem, in this work, we highlight a critical issue: the default opponent policy in these environments lacks sufficient diversity, leading MARL algorithms to overfit and exploit unintended vulnerabilities rather than learning robust strategies. To overcome these limitations, we propose SMAC-HARD, a novel benchmark designed to enhance training robustness and evaluation comprehensiveness. SMAC-HARD supports customizable opponent strategies, randomization of adversarial policies, and interfaces for MARL self-play, enabling agents to generalize to varying opponent behaviors and improve model stability. Furthermore, we introduce a black-box testing framework wherein agents are trained without exposure to the edited opponent scripts but are tested against these scripts to evaluate the policy coverage and adaptability of MARL algorithms. We conduct extensive evaluations of widely used and state-of-the-art algorithms on SMAC-HARD, revealing the substantial challenges posed by edited and mixed strategy opponents. Additionally, the black-box strategy tests illustrate the difficulty of transferring learned policies to unseen adversaries. We envision SMAC-HARD as a critical step toward benchmarking the next generation of MARL algorithms, fostering progress in self-play methods for multi-agent systems. Our code is available at https://github.com/devindeng94/smac-hard.

agent, algorithm, marl algorithm, (9 more...)

arXiv.org Artificial Intelligence

2412.17707

Country:

Asia > Middle East > Jordan (0.04)
Asia > China (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment > Games > Computer Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

In-Context Exploiter for Extensive-Form Games

Li, Shuxin, Yang, Chang, Zhang, Youzhi, Li, Pengdeng, Wang, Xinrun, Huang, Xiao, Chan, Hau, An, Bo

arXiv.org Artificial IntelligenceAug-10-2024

Nash equilibrium (NE) is a widely adopted solution concept in game theory due to its stability property. However, we observe that the NE strategy might not always yield the best results, especially against opponents who do not adhere to NE strategies. Based on this observation, we pose a new game-solving question: Can we learn a model that can exploit any, even NE, opponent to maximize their own utility? In this work, we make the first attempt to investigate this problem through in-context learning. Specifically, we introduce a novel method, In-Context Exploiter (ICE), to train a single model that can act as any player in the game and adaptively exploit opponents entirely by in-context learning. Our ICE algorithm involves generating diverse opponent strategies, collecting interactive history training data by a reinforcement learning algorithm, and training a transformer-based agent within a well-designed curriculum learning framework. Finally, comprehensive experimental results validate the effectiveness of our ICE algorithm, showcasing its in-context learning ability to exploit any unknown opponent, thereby positively answering our initial game-solving question.

algorithm, opponent, opponent strategy, (16 more...)

arXiv.org Artificial Intelligence

2408.05575

Country:

Asia > China > Hong Kong (0.04)
North America > United States > Texas (0.04)
North America > United States > Nebraska (0.04)
(2 more...)

Genre: Research Report > New Finding (0.93)

Industry: Leisure & Entertainment > Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Efficient Partial Monitoring with Prior Information

Neural Information Processing SystemsMar-13-2024, 06:01:37 GMT

Partial monitoring is a general model for online learning with limited feedback: a learner chooses actions in a sequential manner while an opponent chooses outcomes. In every round, the learner suffers some loss and receives some feedback based on the action and the outcome. The goal of the learner is to minimize her cumulative loss. Applications range from dynamic pricing to label-efficient prediction to dueling bandits. In this paper, we assume that we are given some prior information about the distribution based on which the opponent generates the outcomes. We propose BPM, a family of new efficient algorithms whose core is to track the outcome distribution with an ellipsoid centered around the estimated distribution. We show that our algorithm provably enjoys near-optimal regret rate for locally observable partial-monitoring problems against stochastic opponents. As demonstrated with experiments on synthetic as well as real-world data, the algorithm outperforms previous approaches, even for very uninformed priors, with an order of magnitude smaller regret and lower running time.

algorithm, opponent, outcome distribution, (11 more...)

Neural Information Processing Systems

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States (0.04)
Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)

Industry:

Leisure & Entertainment > Games (0.48)
Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.94)
Information Technology > Game Theory (0.68)
Information Technology > Communications > Social Media (0.68)

Add feedback

Influencing Towards Stable Multi-Agent Interactions

Wang, Woodrow Z., Shih, Andy, Xie, Annie, Sadigh, Dorsa

arXiv.org Artificial IntelligenceOct-5-2021

Learning in multi-agent environments is difficult due to the non-stationarity introduced by an opponent's or partner's changing behaviors. Instead of reactively adapting to the other agent's (opponent or partner) behavior, we propose an algorithm to proactively influence the other agent's strategy to stabilize -- which can restrain the non-stationarity caused by the other agent. We learn a low-dimensional latent representation of the other agent's strategy and the dynamics of how the latent strategy evolves with respect to our robot's behavior. With this learned dynamics model, we can define an unsupervised stability reward to train our robot to deliberately influence the other agent to stabilize towards a single strategy. We demonstrate the effectiveness of stabilizing in improving efficiency of maximizing the task reward in a variety of simulated environments, including autonomous driving, emergent communication, and robotic manipulation. We show qualitative results on our website: https://sites.google.com/view/stable-marl/.

agent, ego agent, opponent, (14 more...)

arXiv.org Artificial Intelligence

2110.08229

Country: